feat(auto-routing): benchmark-driven decision engine and kilo-auto/efficient#3982
Open
iscekic wants to merge 75 commits into
Open
feat(auto-routing): benchmark-driven decision engine and kilo-auto/efficient#3982iscekic wants to merge 75 commits into
iscekic wants to merge 75 commits into
Conversation
…ation and table publish
Mints a short-lived (6h) user API token for a given userId, guarded by the shared internal secret over Authorization: Bearer. The decider benchmark uses this to authenticate the kilo CLI against the gateway under a real user's identity.
… container The decider benchmark now executes each case through the stable kilo CLI (@kilocode/cli) running in a Cloudflare Container, instead of bare OpenRouter chat completions, so it measures the real agent harness. - Container (Dockerfile + dependency-free server.mjs) spawns `kilo run --format json --auto` per case; the kilo user token is injected only as a child-process env var, never logged or written to disk. - BenchRunnerContainer DO + wrangler containers/durable_objects/migrations. - kilo-events.ts: pure parser for the CLI JSON event stream (text + cost), tolerant of both part.* and flattened event shapes. - cli-runner.ts: proxies a case to the container and parses the result. - run.ts: chunks decider cases (10/chunk) into per-(model,chunk) queue messages; fetches a short-lived user token once per message; fails fast when benchmarkUserId is unset (plus a defensive per-case guard). Classifier path unchanged. - New benchmarkUserId config field (nullable) on BenchmarkConfig. - vitest aliases @cloudflare/containers to a node-safe stub so unit tests can import the worker entry without the cloudflare:workers chain.
Adds a Benchmark user id input to the benchmark config editor (empty -> null), with help text noting decider runs fail until it is set. Round-trips through configToFormState/formStateToConfig.
…retries - accept step_finish (underscore) events so per-case cost is summed - retry once when a CLI session ends with no assistant text - exact checks also accept the last non-empty output line - uniform final-answer suffix on decider prompts - /admin/debug-cli endpoint returning raw CLI events for diagnosis
- serialize CLI runs per container and run decider cases sequentially (the CLI sqlite migration is unsafe under concurrent sessions) - add dead-letter queue and raise container instance ceiling - redact the kilo token from captured stderr before it leaves the container - timing-safe secret comparison and tokenSource audit field on minted tokens - validate persisted routing tables before serving them from the admin API - regenerate worker types with the production web base URL - dedupe the routing-table response schema; tier boundary tests
ca99949 to
cac57b7
Compare
Contributor
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Executive SummaryAll previously flagged issues are now resolved; the incremental commit correctly fixes the HTTP 400 precondition response for Resolved Issues
Files Reviewed (incremental — 2 files)
Previous Review Summary (commit 9eaae60)Current summary above is authoritative. Previous snapshots are kept for context only. Previous review (commit 9eaae60)Status: 1 Issue Found | Recommendation: Address before merge Executive Summary
Overview
Issue Details (click to expand)WARNING
Resolved Issues (all fixed in prior commits)
Incremental Changes Reviewed (commits 1a5d858 → 9eaae60)
Files Reviewed
Reviewed by claude-4.6-sonnet-20260217 · 415,474 tokens Review guidance: REVIEW.md from base branch |
… classifier dataset to per-pair coverage
…nomy coverage Grow the decider benchmark from 30 to 76 cases so every (taskType, subtaskType) pair in the classifier taxonomy has at least 4 mechanically-checkable cases, with at least 20 cases per difficulty tier (23 low / 31 medium / 22 high). - DeciderCase gains subtaskType; ids follow the <taskType>-<subtype>-<topic> scheme used by the classifier dataset - Existing cases retagged with subtypes where they genuinely fit (three system-behavior investigation cases moved to planning_design/system_design, the HTTP 201 lookup to investigation/external_research, and the let-closure case reframed as refactoring/migration) - New agentic_execution cases are self-contained file/terminal tasks deterministic in the node:22-slim container - Tests now enforce per-pair and per-tier quotas from the classifierTaxonomy export, subtype/taskType consistency, regex compilability, and json_equal round-tripping
Remember the last served model per conversation in the decision-cache DO and keep it while it meets the current tier's accuracy threshold, unless the fresh pick is cheaper by more than the routing table's new switchCostFactor. Switching models discards provider prompt caches, so a session whose difficulty tier oscillates no longer ping-pongs between models. Decisions report a sticky flag in the response and the auto_routing_decision log line.
…runs, and routing table Store the new BenchmarkConfig.switchCostFactor in the benchmark_config singleton, snapshot it into benchmark_runs at startRun, and carry the run's snapshotted value into published routing tables so the schema's required RoutingTableSchema.switchCostFactor parses on read. Regenerate the squashed D1 baseline migration, add a Switch cost factor field to the admin config form, and update test fixtures (including the apps/web decision fixtures missing the new required sticky flag).
…icient-decision-engine
…e at config save All decider candidates are served via providers that speak every gateway chat API (in practice OpenRouter), so per-candidate supportedApiKinds was dead weight in the contracts, decision engine, D1 schema, and routing table. The one real failure mode - an admin configuring a model whose serving provider is chat-completions-only - is now rejected at config save time instead.
- never let a heuristic fallback classification re-anchor the session's sticky model (same trust rule as the classification cache) - drop the dead ClassifierApiKindSchema export - rename the decider pages-helper case so its id no longer collides with the classifier dataset's debug-fix-pagination-slice in shared telemetry - trim a stale JSDoc in model-api-kinds.ts
- Inject KILO_API_URL into the benchmark container via a new KILO_CLI_API_URL worker var so the kilo CLI targets the same gateway the worker mints tokens against (prod default: api.kilo.ai). - Add .dev.vars.example mapping both URLs to the local apps/web dev server (worker-side localhost, container-side host.docker.internal). - Add AUTO_ROUTING_BENCHMARK_WORKER_URL to the apps/web env example so the admin panel proxies to the local benchmark worker instead of prod. - Work around wrangler force-pulling the amd64 container egress proxy on Apple Silicon (its transparent-proxy setsockopt crashes under emulation, failing every local container start) by pinning the arm64 manifest digest via MINIFLARE_CONTAINER_EGRESS_IMAGE in the dev runner.
…meout The kilo bin is a Node wrapper that spawns the real CLI binary as a grandchild. SIGKILLing only the wrapper orphaned the grandchild on timeout: it kept running (and spending) and held the stdout/stderr pipes open, so 'close' never fired, the case promise never resolved, and the chunk's queue message hung until the runtime cut it — then retried from case 0 and eventually dead-lettered. Observed live: a runaway agentic case ran 20+ minutes past the 180s cap and wedged the whole run. Spawn the CLI detached so it leads its own process group, kill the group on timeout, and add an after-exit grace backstop so a stray pipe-holder can never hang a case again.
…r latency gate - Config gains classifierRepetitions, deciderRepetitions (1-5), and classifierMaxP95LatencyMs (null = no constraint); run rows snapshot the active repetition count and latency budget at start time. - case_results PK extended with rep column; timed_out column added. - model_summaries gains p95_latency_ms (nearest-rank p95 over all rows) and timeouts count. - pickClassifierWinner enforces an optional p95 latency budget: candidates meeting both accuracy and latency are ranked by cost; when none meet the budget, falls back to lowest-p95 among accuracy-meeting models. - classifier_winner contract surfaces the winner's p95LatencyMs. - DECIDER_CHUNK_SIZE reduced from 10 to 5 to stay well within queue consumer wall-clock limits. - Container server propagates timedOut flag through ContainerRunResponse and CliRunResult so timed-out cases are recorded in D1.
…test gaps - Migration 0001: replace "rep"/"timed_out" column refs in INSERT...SELECT with literal 0,0 — old table lacks those columns; D1 silently degrades double-quoted unknowns to string literals, corrupting NOT NULL integer rows. - Contracts: add BenchmarkConfigSchema defaults test (classifierRepetitions=1, deciderRepetitions=1, classifierMaxP95LatencyMs=1000 when omitted). - Benchmark: extract buildDeciderMessages() pure function; add fan-out test asserting models × reps × ceil(76/5) messages each carrying the correct rep.
…olumns Add classifier/decider repetitions (1–5) and classifierMaxP95LatencyMs inputs to the Benchmark Config card; add p95 latency and Timeouts columns to the run summaries table; update test fixtures with new fields.
Set both RunSummariesTable colSpan values back to 6 to match the outer BenchmarkRunsTable's 6-column header (chevron, Kind, Status, Started, Completed, Error). Export configToFormState and formStateToConfig for unit testing and add focused tests covering null-config defaults, round-trip preservation of repetitions/latency fields, and empty-string classifierMaxP95LatencyMs coercing to null.
…icient-decision-engine
…ests Main merged PR #4004 which deleted the morph provider. The two test files that exercised the rejection branch of modelServesAllGatewayChatApis used morph as the only available Kilo-exclusive model on a chat_completions-only gateway. With morph gone, no real catalog entry satisfies that condition. Both test files now stub findKiloExclusiveModel via jest.mock/requireActual so that the marker id 'test-exclusive/alibaba-only' returns a KiloExclusiveModel with gateway: 'alibaba'. The real PROVIDERS.ALIBABA definition supports only chat_completions, so the rejection path is exercised without relying on any specific provider file being present in the catalog.
…onfig The POST /admin/runs handler let startRun's "config not set" precondition error propagate to the global error handler, surfacing a client-side precondition as HTTP 500. Guard the null config in the route handler, mirroring the /admin/debug-cli pattern, and return 400 instead.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Benchmark-driven decision engine and
kilo-auto/efficientSummary
Adds a benchmark-driven model-routing pipeline behind a new hidden virtual model,
kilo-auto/efficient: route each request to the cheapest model that is proven (by our own benchmarks) to be accurate enough for the request's difficulty.Three moving parts:
services/auto-routing-benchmark(new Cloudflare Worker): runs two deterministic benchmarks — classifier prompt replay via OpenRouter, and decider golden tasks through the realkiloCLI in a Cloudflare Container — writes normalized results to D1, and publishes a routing table (per-difficulty-tier ranked candidates) plus a classifier winner.services/auto-routing(existing worker):/decideclassifies the request, derives a difficulty tier, and picks the cheapest above-threshold model from the routing table, with session-sticky decisions held in a Durable Object.apps/web(gateway): exposeskilo-auto/efficient, blocks on/decidewith a 2s timeout, falls back to the balanced Qwen default, bills the classifier LLM cost to the requesting user, and adds an admin panel for the whole pipeline.Shared classifier code (prompt, parsing, fallback, taxonomy, tier derivation, routing-table schema) moves into the new
packages/auto-routing-contractspackage, so the benchmark replays exactly the code the production worker executes.Architecture
Benchmark worker (
services/auto-routing-benchmark)Classifier benchmark
Replays 72 normalized classifier inputs through OpenRouter using the exact production classifier code (
@kilocode/auto-routing-contracts/classifier). Each output is graded per-field against a hand-labeled expectation viaCLASSIFIER_FIELD_WEIGHTS(src/grading.ts): taskType 0.25, reasoningComplexity 0.20, contextComplexity 0.15, executionMode 0.15, subtaskType 0.10, requiresTools 0.10, riskLevel 0.05. Heuristic-fallback outputs score 0.The winner (
src/winner.ts) is the cheapest model meeting the run's accuracy threshold (most accurate one if none do). It feeds the worker's classifier-model resolution chain (below).Decider benchmark
Runs 76 golden tasks per candidate model through the real
kiloCLI (@kilocode/cli) inside a Cloudflare Container (container/Dockerfile+container/server.mjs, node:22-slim, standard-2). Grading is purely mechanical —exact/contains_all/regex/json_equalchecks, no LLM judges; golden answers were hand-derived and mechanically re-verified where executable. Cases include genuinely agentic tasks performed with file/terminal tools inside the container (deterministic: no repo, no network).Execution details:
runId:model:chunk) so models/chunks never share state. CLI runs are serialized per instance (the CLI's sqlite state is not safe under concurrent first runs); a/warmupendpoint absorbs the one-time sqlite migration before the case loop.reasoningEffortis forwarded as the CLI's--variant, so the benchmark measures the model exactly as it will be served.apps/web's internal endpoint (token only ever lives in a child-process env var, never logged or written to disk).Datasets
Both datasets cover all 18 (taskType, subtaskType) taxonomy pairs with at least 4 cases per pair — enforced by tests (
src/datasets/*.test.ts). Decider cases each carry exactly one difficulty tier with at least 4 distinct task types per tier.D1 schema (
src/db-schema.ts)Fully normalized, zero JSON blob columns, composite-PK-only access:
benchmark_config+config_classifier_models+config_decider_models— admin config (incl. per-decider-modelreasoning_effort).benchmark_runs— carries a config snapshot (min_accuracy,switch_cost_factor,max_concurrency,benchmark_user_id) taken atstartRuntime, so mid-run admin edits can't skew results. All job processing and publishing reads the snapshot, never live config.run_models— which models were enqueued vs. skipped, with the pinnedreasoning_effortsnapshot.case_results— per (run, model, case) score/latency/cost plus diagnostics (classifier fallback reason, CLI exit code/output prefix/event tail).model_summaries— per (run, model, tier) aggregates. Carried summaries: models with prior results are skipped on new runs (their latest summaries are copied in withcarried=true), so re-runs only spend on new candidates; the admin can force a full re-run.routing_tables+routing_table_candidates— published tables, queryable history.Single squashed baseline migration (
migrations/0000_amused_shard.sql), applied by apredeployscript (wrangler d1 migrations apply --remote) which the CI deploy workflow now runs for any worker that defines one (.github/workflows/deploy-workers.yml).Publishing
On run completion the worker builds the routing table from the run's own snapshot (
src/routing-table-builder.ts): per tier, candidates are ranked best-bang-for-buck (above-threshold cheapest-first, below-threshold by accuracy). Models with zero graded cases or no cost signal in a tier are excluded; if any tier ends up empty the publish is skipped and the previous table stays live (schema enforces.min(1)per tier). Publishing only deletes the KV cache keys so the auto-routing worker repopulates from D1 on the next read.Decision engine (
services/auto-routing)/decide(existing endpoint, now decision-capable):deriveDifficultyTierin contracts: reasoning complexity dominates at 2x weight; context, execution mode, and risk nudge borderline cases).src/decision-engine.ts): cheapest above-threshold candidate for the tier — unless the session has an incumbent.Session stickiness: the conversation's Durable Object remembers the last served model. The incumbent is kept while it still meets the tier's accuracy threshold, unless the fresh pick is cheaper by more than the table's
switchCostFactor. Rationale (commented in code): a model switch discards the provider's prompt cache, and rebuilding it costs full-price input tokens (4–10x cache-read rates) on a context that dominates agent-session spend — switching only pays off when recurring per-turn savings clearly exceed that one-time penalty. Sticky state trusts only real classifier output: heuristic fallbacks never re-anchor the session's model.Routing table access: read-through chain — isolate-local 60s TTL cache → KV (1h TTL, shared
AUTO_ROUTING_CONFIGnamespace) → service binding to the benchmark worker's D1-backed/admin/routing-table. Corrupt KV values are treated as misses; origin failures degrade to null (no decision) rather than erroring the request.Classifier model resolution (
src/classifier-config.ts): admin KV override → benchmark winner (same KV read-through, derived on read) → built-in defaultgoogle/gemini-2.5-flash-lite. A benchmark-origin failure never discards a healthy override.Gateway (
apps/web)kilo-auto/efficient(src/lib/ai-gateway/auto-model/index.ts): hidden virtual model (excluded from/models, usable by id) with the same catalog properties as balanced — intended to eventually replace it, hidden while validated on Kilo team traffic.auto-model/resolution.ts+auto-routing-decision.ts): blocks on/decidewith a 2s timeout; on a decision, serves the decided model and applies its pinnedreasoningEffortso it runs under the same conditions the benchmark measured. On null/timeout/error, servesBALANCED_QWEN_MODEL— an efficient request never degrades below balanced./decideis billed to the requesting user as a separate microdollar usage row (requested_model: kilo-auto/efficient), so routing overhead is visible and attributed rather than absorbed.admin/auto-routing/BenchmarksSection.tsx, proxied through admin API routes with the internal secret): config editor (classifier/decider model lists, per-deciderreasoningEffort,minAccuracy,switchCostFactor,maxConcurrency,benchmarkUserId), run triggers with a force-rerun toggle, run history, and the live published routing table.admin/api/auto-routing/benchmark-config/route.ts): every decider model must be servable on all gateway chat API kinds (chat_completions,responses,messages) by the provider the gateway would route it to — the routing table deliberately carries no per-protocol metadata, so this invariant is enforced at write time.api/internal/auto-routing-benchmark/token/route.ts): POST gated byINTERNAL_API_SECRET; mints a 6h full user API token (tokenSource: auto-routing-benchmark) for the decider CLI's identity/billing.Design properties
/decidereturns null decisions until a benchmark publishes one, and the gateway serves balanced fallbacks. There is no default benchmark config: runs refuse to start until an admin saves one (and decider runs additionally fail fast without abenchmarkUserId).Infrastructure
auto-routing-benchmarkin region EEUR, primary in Frankfurt (colo FRA — next to the backend; verified viawrangler d1 info).auto-routing-benchmark-jobs(max_concurrency 4, max_retries 2) + DLQauto-routing-benchmark-dlq.auto-routing-benchmark-runner(standard-2, max 40 instances), image built and pushed bywrangler deploy.auto-routing→auto-routing-benchmark; shared KV namespaceAUTO_ROUTING_CONFIG.Post-merge deploy / cutover checklist
kilo-auto/efficient, admin panel, token mint).CLOUDFLARE_API_TOKENneeds D1 edit permission (the deploy will surface it if missing).benchmarkUserIdis required for decider runs (consider a dedicated service account — its account is billed for CLI usage); suggestedswitchCostFactorstarting value: 3.classifier_modelKV override (currently set to flash-lite) if the benchmark winner should drive classifier selection.Reviewer notes
exactdecider check also accepts the last non-empty output line (src/grading.ts): agent harnesses sometimes prepend preamble despite instructions; wrong answers fail either way.@kilocode/cli@latestis resolved at image build time, i.e. each deploy pins whatever waslatestthen; re-deploy to pick up a newer CLI.INTERNAL_API_SECRETand can mint for any user id; scoping it to the configured benchmark user is a reasonable follow-up.chat_completionsonly (the CLI's path). Config-save validation guarantees candidates are servable on all three chat API kinds, but accuracy is only measured on one.